On this page we will give you a bird-eye-view of school bus delay in Manhattan from 2018 Fall to 2021 Fall. We aims to explore the trends of the school bus delay occurrence and delay time over the school years. The time period we chose includes both pre and post covid data which will be interesting to look at ! Now lets dive right into it!
First,we would like to get a general view of the frequency of school bus delay in Manhattan from the fall semester 2018 to most recently. As we can see from the plot below, Covid-19 has a huge impact on the school bus delay over the time. Before the pandemic(2018-2020), we can see that the school buses are more likely to delay in the fall semester than the spring semesters especially during December and January. This may be caused by extreme weather in the winter and the heavy traffic during holiday seasons.In 2020-2021 when most schools switched online to cope with the pandemic , we can see there is striking low school bus delay. This is not hard to understand given there are less school bus route running and less cars on streets due to lock down and social distancing. In the fall of 2021, schools are switching back to in-person mode with the effort of vaccine enforcement and all other precautions taken, the school bus delay has a similar distribution as the pre-pandemic distribution.
#read data in
df = read_csv("data/clean_data.csv")
## Rows: 67775 Columns: 34
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (22): month, date_noYear, school_year, run_type, bus_no, route_number, ...
## dbl (10): year, x1, busbreakdown_id, number_of_students_on_the_bus, inciden...
## date (1): occur_date
## time (1): occur_time
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#plot data: drop NA rows
plot_df = df %>%
select(school_year, year, how_long_delayed,occur_date, day, month,busbreakdown_id) %>%
na.omit()
#plot daily delay counts
plot_df %>%
group_by(occur_date) %>%
summarise(delay_count = n(),school_year) %>%
ggplot(aes(x = occur_date, y = delay_count)) +
geom_point(alpha = .3) +
geom_smooth(se = FALSE) +
facet_grid(~school_year,scales = "free") +
labs(
title = "Daily School Bus Delay Counts in Manhattan",
x = "Date",
y = "Counts"
) +
theme(axis.text.x = element_text(angle = 60, vjust = 0.5, hjust = 1))
## `summarise()` has grouped output by 'occur_date'. You can override using the `.groups` argument.
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Now we want to take a closer look at the counts of school bus delay in each month and compare the school years over 2018 to 2021.
First we can see that except for the 2020-2021 school year , we have a similar trend for school bus delay distribution from January to December. The school bus delay is more frequent in the winter from November to the Next February. The frequency of school bus delay is decreasing in number over the years. We are happy tp see that the NY school bus system is improving on the overall level. The school bus system is probability shut down or has limited routes operating given the lock down in 2020-2021. There is less than 100 delays each month in those days.
#overlaping each year's daily delay counts
overlaping_plot = plot_df %>%
group_by(occur_date) %>%
mutate(delay_count = n(),
day_of_month = lubridate::mday(occur_date),
common_day = lubridate::mdy(paste(month, day_of_month,"2021")))%>%
ggplot(aes(x = common_day, y = delay_count, group = school_year, color = school_year)) +
geom_point(alpha = .3) +
geom_smooth(se = FALSE) +
labs(
title = "School Bus Delay in Manhattan",
x = "Month",
y = "Delay Counts"
) +
scale_x_date(date_breaks = "1 month", labels = function(x) format(x, "%b"))+ guides(fill=guide_legend(title="School Year"))
ggplotly(overlaping_plot)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Except for the frequency of school bus delay, students like us also care about the delay time of the school bus. Surprisingly, we see that the most majority delays take about 31 - 45 minuets before the pandemic.Though we have lower frequency of bus delay after the pandemic, the majority of the bus delay takes about 46-60 minuets. For Student, this might not be as bad as it looks.The overall probability of being late due to school bus delay should decrease, though they may be late for a longer period of time given each delay they run into.
delay_time_dist =
plot_df %>%
ggplot(aes(x = how_long_delayed,fill = how_long_delayed))+
geom_bar(position = "dodge")+
facet_grid(~school_year,scales = "free") +
labs(
title = "Distribution of School Bus Delay Time in Manhattan",
x = "Delay Time",
y = "Counts"
)+
theme(axis.title.x = element_blank(),axis.text.x = element_text(angle = 60, vjust = 0.5, hjust = 0.5)) +
guides(fill=guide_legend(title="Delay Time"))
ggplotly(delay_time_dist)
lets now overlap the school bus delay counts within each school year
Plot4:
df_date_weather_na =
df %>%
drop_na(how_long_delayed) %>%
mutate(
month = forcats::fct_relevel(as.factor(month), "September", "October", "November", "December", "January", "February", "March", "April", "May", "June"),
day = forcats::fct_relevel(as.factor(day), "Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
) %>%
filter(!(school_year == "2021-2022"))
df_date_weather_na
## # A tibble: 59,424 x 34
## year occur_date month date_noYear x1 school_year busbreakdown_id run_type
## <dbl> <date> <fct> <chr> <dbl> <chr> <dbl> <chr>
## 1 2018 2018-09-05 Sept~ 09/05 11 2018-2019 1456638 Special~
## 2 2018 2018-09-05 Sept~ 09/05 12 2018-2019 1456639 Special~
## 3 2018 2018-09-05 Sept~ 09/05 46 2018-2019 1456670 Special~
## 4 2018 2018-09-05 Sept~ 09/05 47 2018-2019 1456671 Special~
## 5 2018 2018-09-05 Sept~ 09/05 51 2018-2019 1456675 Special~
## 6 2018 2018-09-05 Sept~ 09/05 52 2018-2019 1456676 Special~
## 7 2018 2018-09-05 Sept~ 09/05 58 2018-2019 1456681 Special~
## 8 2018 2018-09-05 Sept~ 09/05 75 2018-2019 1456698 Special~
## 9 2018 2018-09-05 Sept~ 09/05 79 2018-2019 1456702 Special~
## 10 2018 2018-09-05 Sept~ 09/05 83 2018-2019 1456706 Special~
## # ... with 59,414 more rows, and 26 more variables: bus_no <chr>,
## # route_number <chr>, reason <chr>, schools_serviced <chr>,
## # occur_time <time>, created_on <chr>, boro <chr>, bus_company_name <chr>,
## # how_long_delayed <chr>, number_of_students_on_the_bus <dbl>,
## # has_contractor_notified_schools <chr>,
## # has_contractor_notified_parents <chr>, have_you_alerted_opt <chr>,
## # informed_on <chr>, incident_number <dbl>, last_updated_on <chr>, ...
df_date_weather_plot =
df_date_weather_na %>%
group_by(month, how_long_delayed, school_year) %>%
summarize(n_obs = n()) %>%
mutate(how_long_delayed = forcats::fct_relevel(as.factor(how_long_delayed), "0-15 Min", "16-30 Min", "31-45 Min", "46-60 Min", "61-90 Min")
)
## `summarise()` has grouped output by 'month', 'how_long_delayed'. You can override using the `.groups` argument.
p1 = ggplot(df_date_weather_plot, aes(x = month, y = n_obs)) +
geom_col(alpha = 0.6) +
facet_grid(~how_long_delayed) +
geom_line(aes(group = school_year, color = school_year)) +
labs(
x = "month",
y = "counts",
title = "the monthly distribution for different length of time"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
ggplotly(p1)
Plot5:
df_date_weather_plot2 =
df_date_weather_na %>%
group_by(day, school_year, how_long_delayed) %>%
summarize(n_obs = n())
## `summarise()` has grouped output by 'day', 'school_year'. You can override using the `.groups` argument.
p2 = ggplot(df_date_weather_plot2, aes(x = day, y = n_obs)) +
geom_col(alpha = 0.6) +
facet_grid(~school_year) +
labs(
x = "weekday",
y = "counts",
title = "the daily distribution for different length of time"
) +
geom_line(aes(group = how_long_delayed, color = how_long_delayed)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
ggplotly(p2)